1 Themes

  1. The reproducibility crisis
  2. What does reproducible science look like?
  3. Using R & RStudio for reproducible science
  4. Advanced topics in R-eproducible science
  5. The future of reproducible psychological science

2 The reproducibility crisis

I like to start off talks about reproducibility in science with some humor. This video is a few years old, but it has some timeless insights.

What’s the point? That even the most well-meaning of us can make careless errors that undermine the reproducibility of science.

But, is it a crisis?

In 2016, Nature published the results of a survey of 1,500 scientists (Baker 2016). They were asked a number of questions, including the following:

2.1 Is there a reproducibility crisis?

  • Yes, a significant crisis
  • Yes, a slight crisis
  • No crisis
  • Don’t know

2.1.1 Results from our own bootcamp poll

2.2 Problems with reproducibility extend beyond psychology and behavioral science

2.3 Why is reproducibility hard?

2.3.1 A manifesto for reproducible science

(Munafò et al. 2017)

3 What does reproducible science look like?

3.1 What do we mean by ‘reproducibility’?

Goodman et al., 2016

  • Methods reproducibility
    • Enough details about materials & methods recorded (& reported)
    • Same results with same materials & methods
  • Results reproducibility
    • Same results from independent study -Inferential reproducibility
    • Same inferences from one or more studies or reanalyses

3.2 Achieving methods reproducibility

  • My own workflow
    • Data collection
    • Cleaning
    • Visualization
    • Analysis
    • Reporting
    • Manuscript generation?
  • Avoiding the ‘hit by a truck’ scenario

Can someone pick up where you left off without significant loss of time and momentum?

3.3 Reproducible workflows

  • Scripted, automated = minimize human-dependent steps.
  • Well-documented
  • Be kind to your future (forgetful) self
  • Transparent to me & colleagues == transparent to others

4 Using R (and RStudio) for reproducible science

4.1 Scripting

Think of each step of your data workflow

# Import/gather data

# Clean data

# Visualize data

# Analyze data

# Report findings

Imagine writing R code to handle each step.

# Import data
my_data <- read.csv("path/2/data_file.csv")

# Clean data
my_data$gender <- tolower(my_data$gender) # make lower case
...

You could put all the code in one script file, or even better, have separate scripts for each step that you source() one by one.

# Import data
source("R/Import_data.R") # source() runs scripts, loads functions

# Clean data
source("R/Clean_data.R")

# Visualize data
source("R/Visualize_data.R")
...

4.1.1 Scripts: Strengths & Weaknesses

  • R commands in files that can be re-run
  • Separate pieces of workflow kept separate
  • “Master.R” script that can be run to regenerate full sequence of results
    • Error in raw data file?
    • No problem; fix data file and re-run “Master.R”
  • How to save results or share with collaborators?

4.2 RStudio projects

RStudio has a ‘projects’ function that I strongly recommend you use.

  • Create using ‘File/New Project’
  • Store in new directory with sensible name (‘projects/1st_year_proj’)
  • Creates *.Rproj file
  • Turn-off saving data to the .RData file
  • ‘File/Open Project…’ command opens ‘fresh’ workspace

Here’s an example of some recent projects I’ve worked on.

Using RStudio projects helps keep your files and settings organized. It’s easy to switch between projects. It reduces mental effort (what directory am I in?), and especially avoids having to use directory-settign commands like setwd() that will only work on your computer RStudio projects also integrates with version control (e.g., GitHub).

4.3 R Markdown

  • Combine text, R code, images, comments, videos into one document
  • Render as web page (or site), PDF, MS Word (.docx), slides, etc.

4.3.1 What is R Markdown?

  • Markdown is a scripting ‘language’ used in lots of blogging engines and wikis
    • Simple commands to format documents
    • Designed to be easier than writing in raw HTML
  • Human & machine readable
  • R Markdown extends Markdown with commands specialized for R code
  • Tutorial by Hadley Wickham
  • Cheatsheets

# Big idea

## Smaller idea in service of bigger

- Supporting point
- Another suppporting point

1. an enumerated **bold** point
1. an enumerated *italicized* point
- a [link](http://psu-psychology.github.io/r-bootcamp) to this bootcamp
- an image: ![rawr](https://www.insidehighered.com/sites/default/server_files/media/PennState2.PNG)
- an equation: $e = mc^2$
# Some R code
ggplot2::qplot(rnorm(100))

# Some R code
ggplot2::qplot(rnorm(100))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

4.3.2 One file, many possible outputs

- [pdf_document](http://rmarkdown.rstudio.com/pdf_document_format.html), [word_document](http://rmarkdown.rstudio.com/word_document_format.html), or [github_document](http://rmarkdown.rstudio.com/github_document_format.html)
- [ioslides_presentation](http://rmarkdown.rstudio.com/ioslides_presentation_format.html) for HTML slide show
- Cool interactive web-apps in Shiny
- Web sites like the one for this [bootcamp](https://github.com/psu-psychology/r-bootcamp-2018), [blogs](https://bookdown.org/yihui/blogdown/), even [books](https://bookdown.org/yihui/bookdown/)

4.3.3 Example using 2018 R bootcamp data

I’ve analyzed the survey data you provided using an R Markdown document bootcamp-survey.Rmd. Let’s open it up and see how it looks.

The default format is an html_document but I can easily produce outputs in different formats simply by changing a parameter.

rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "pdf_document")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "word_document")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = "ioslides_presentation")
rmarkdown::render('talks/bootcamp-survey.Rmd', output_format = c("pdf_document", "word_document", "github_document", "ioslides_presentation")

So, I can prepare one document but many different output formats. Your adviser likes PDF? No problem. Your collaborator prefers MS Word? Got it covered. Need to give a quick brown bag talk you can give from any web browser? Easy.

5 Advanced topics in R-eproducible science

5.1 Why write reproducible papers? (Frank & Hartgerink)

The following is section is copied verbatim from Mike Frank & Chris Hartgerink’s tutorial on GitHub.

There are three reasons to write reproducible papers. To be right, to be reproducible, and to be efficient. There are more, but these are convincing to us. In more depth:

To avoid errors. Using an automated method for scraping APA-formatted stats out of PDFs, (Nuijten et al. 2015) found that over 10% of p-values in published papers were inconsistent with the reported details of the statistical test, and 1.6% were what they called “grossly” inconsistent, e.g. difference between the p-value and the test statistic meant that one implied statistical significance and the other did not. Nearly half of all papers had errors in them.

To promote computational reproducibility. Computational reproducibility means that other people can take your data and get the same numbers that are in your paper. Even if you don’t have errors, it can still be very hard to recover the numbers from published papers because of ambiguities in analysis. Creating a document that literally specifies where all the numbers come from in terms of code that operates over the data removes all this ambiguity.

To create spiffy documents that can be revised easily. This is actually a really big neglected one for us. At least one of us used to tweak tables and figures by hand constantly, leading to a major incentive never to rerun analyses because it would mean re-pasting and re-illustratoring all the numbers and figures in a paper. That’s a bad thing! It means you have an incentive to be lazy and to avoid redoing your stuff. And you waste tons of time when you do. In contrast, with a reproducible document, you can just rerun with a tweak to the code. You can even specify what you want the figures and tables to look like before you’re done with all the data collection (e.g., for purposes of preregistraion or a registered report).

5.1.1 Example of a reproducible paper using the papaja package

It’s possible to write a paper like this from an R Markdown document that looks like this. Let’s peek under the hood just a bit.

So, there’s much more to say about how to do this than we have time for today. This guide or this are good places to start. But I think we can all agree that pushing a button to render a complete paper, including tables, figures, and references, is pretty amazing.

5.2 Version control

Track changes is great? Right? But if you’ve ever written a lengthy document with other people, you’ve experienced the challenge of tracking versions across time. At some point, the changes become too extensive to track, and so the author(s) decide to accept or reject a bunch and create a new version. This is how version control becomes an extension of the track changes problem. Most of us have experienced something like this sequence: ‘paper.docx’, ‘paper_new.docx’, ‘paper_new_new.docx’, ‘paper_new_new_ROG.docx’, etc.

My current scheme with colleagues is something like this: ‘nsf_grant_2018-08-16v1.docx’, ‘nsf_grant_2018-08-16v2.docx’, etc. That is, each person who modifies the document saves it as a new version. It doesn’t avoid conflicts if we’re working in parallel, but it does help us track down where we went astray.

Imagine a scheme for doing this automatically with your R and RStudio files? RStudio incorporates two ‘version control’ systems from the software development world, ‘git’ and ‘subversion’. I use ‘git’ and a web-based service for managing projects that use git called GitHub.

5.2.1 Rick’s GitHub workflow

We don’t have time to go into git and GitHub here, but I strongly recommend Jenny Bryan’s tutorial Happy Git and GitHub for the useR. In the meantime, this is the workflow I use for almost every project I do that will involve R:

  1. Create a repo on GitHub
  2. Copy repo URL
  3. File/New Project.../
  4. Version Control, Git
  5. Paste repo URL
  6. Select local name for repo and directory where it lives.
  7. Open project within R Studio File/Open Project...
  8. Commit (upload a commented version) early & often

These videos show this workflow in action.


This way, I always have local and web-based copies of my latest work. It’s easy to share with collaborators. I just send them a URL to the project. This is how Michael and I worked together on the bootcamp. And it’s how I create all of my teaching and talk slides.

5.3 Web sites with R Markdown

When I say most of my work these days is in RStudio, I’m not kidding. It’s easy to create simple websites that start out as R Markdown documents. The bootcamp’s website is an example.

If you’re curious about this, please feel free to examine the ‘guts’ of the bootcamp’s repository. It includes a _site.yml file that contains site configuration parameters, an index.Rmd home page for the site and other *.Rmd files that get converted into pages, and directories for files.

To create the site, I simply enter rmarkdown::render_site() from the console, commit the changes and push them to GitHub. GitHub has its own web hosting service called GitHub pages that makes it easy to create and modify simple websites.

6 The future of reproducible psychological science…

I’m very optimistic about the future of psychological science. Our science is harder than physics, chemistry, and engineering because it incorporates phenomena from all of them. And the survey data show that failures to replicate in these other fields are common, more common than most outsiders would imagine. Behavioral scientists are becoming more mindful of the challenges to robustness in our work and are making strides to bolster it.

6.1 Let’s learn more, faster

At the end of the day, by making our research more reproducible, our findings will be more robust, and by sharing our findings, displays, data, and analysis code more widely and openly, we’ll all learn more, faster. That’s Databrary’s motto, by the way.

So, I think that in the very near future we’ll see the following:

  • transparent, reproducible, open workflows across the publication cycle
  • Openly shared materials + data + code
  • (Munafò et al. 2017): reproducible practices across the workflow
  • Data, materials (displays, code) stored in central repository
  • (Gilmore and Adolph 2017): video and reproducible behavioral science
  • Access/analyze materials/displays via repository application program interfaces (APIs)

These will Building a ‘cumulative’ science (Mischel 2011)

6.2 Learn from my mistakes

I still consider myself a ‘student of R’. I learn new tricks and techniques all the time. You can teach an old dog new tricks. He just has to be motivated. It takes longer, and you have to be patient.

But you can learn from my mistakes:

  • Script everything you possibly can
    • If you have to repeat something, make a function or write a parameterized script
  • Document all the time
    • Comments in code
    • Update README files
  • Don’t be afraid to ask
  • Don’t be afraid to work in the open
  • Learn from others
  • Just do it!

via GIPHY

In many ways, learning R is like acquiring a super power.

So, go be super-powerful. But remember that with great power comes great responsibility.

7 Materials

This talk was produced on 2018-07-26 17:01:33 in RStudio 1.1.453 using R Markdown. The code and materials used to generate the slides may be found at https://github.com/psu-psychology/r-bootcamp-2018/. Information about the R Session that produced the slides is as follows:

sessionInfo()
## R version 3.5.1 (2018-07-02)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Sierra 10.12.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] forcats_0.3.0   stringr_1.3.1   dplyr_0.7.6     purrr_0.2.5    
## [5] readr_1.1.1     tidyr_0.8.1     tibble_1.4.2    ggplot2_3.0.0  
## [9] tidyverse_1.2.1
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.18     cellranger_1.1.0 pillar_1.3.0     compiler_3.5.1  
##  [5] plyr_1.8.4       bindr_0.1.1      tools_3.5.1      digest_0.6.15   
##  [9] lubridate_1.7.4  jsonlite_1.5     evaluate_0.11    nlme_3.1-137    
## [13] gtable_0.2.0     lattice_0.20-35  pkgconfig_2.0.1  rlang_0.2.1     
## [17] cli_1.0.0        rstudioapi_0.7   yaml_2.1.19      haven_1.1.2     
## [21] bindrcpp_0.2.2   withr_2.1.2      xml2_1.2.0       httr_1.3.1      
## [25] knitr_1.20       hms_0.4.2        rprojroot_1.3-2  grid_3.5.1      
## [29] tidyselect_0.2.4 glue_1.3.0       R6_2.2.2         readxl_1.1.0    
## [33] rmarkdown_1.10   modelr_0.1.2     magrittr_1.5     backports_1.1.2 
## [37] scales_0.5.0     htmltools_0.3.6  rvest_0.3.2      assertthat_0.2.0
## [41] colorspace_1.3-2 stringi_1.2.4    lazyeval_0.2.1   munsell_0.5.0   
## [45] broom_0.5.0      crayon_1.3.4

References

Gilmore, R O, and K E Adolph. 2017. “Video Can Make Behavioural Science More Reproducible.” Nature Human Behavior 1 (June). doi:10.1038/s41562-017-0128.

Mischel, W. 2011. “Becoming a Cumulative Science.” APS Observer 22 (1). https://www.psychologicalscience.org/observer/becoming-a-cumulative-science.

Munafò, MR, BA Nosek, DVM Bishop, KS Button, CD Chambers, NP du Sert, U Simonsohn, E-J Wagenmakers, JJ Ware, and JPA Ioannidis. 2017. “A Manifesto for Reproducible Science.” Nature Human Behaviour 1 (10~jan): 0021. doi:10.1038/s41562-016-0021.

Nuijten, MB, CHJ Hartgerink, MALM van Assen, S Epskamp, and JM Wicherts. 2015. “The Prevalence of Statistical Reporting Errors in Psychology (1985–2013).” Behavior Research Methods, October, 1–22. doi:10.3758/s13428-015-0664-2.